Tools.h++: Internationalization

Written by Nathan Myers, Rogue Wave Softwre, Inc.

Copyright Rogue Wave Software, Inc. 1993


Editor's note: This paper discusses internationalization in terms of Tools.h++, Rogue Wave's fundamental object library. However, many of the discussions about issues involved in the internationalization of software are generally applicable independent of the product used.

Table of Contents

Introduction

Gone are the days when we could ignore our neighbors across the sea (or over the fence), writing software only for local consumption. Professional software development today demands not only awareness of the needs of users in other cultures, but accommodation of those needs. This accommodation is called localization; making software easily localized is called internationalization.

Internationalization actually involves many different activities, potentially as many as the ways in which cultures differ from one another. In practice, it usually means accommodating differences in alphabets, languages, currencies, numbers, and date- and time-keeping notations. Let us consider each of these in turn.

Accommodation of different alphabets begins with allowing them to be represented. A first step in this direction is making code "8-bit clean", which lets it tolerate extensions. Still, eight bits just isn't enough to represent all the character glyphs we use, even in English. Some extension beyond 8 bits is required, and in fact several are in use, falling into two families: multibyte and wide-character encodings.

Multibyte encodings use a sequence of one or more bytes to represent a single character. (Typically the ASCII characters are still one byte long.) This gives a compact encoding, but is inconvenient for indexing and substring operations. Wide character encodings, in contrast, place each character in a 16- or 32-bit integral type called a wchar_t, and represent a string as an array of wchar_t. Usually it is possible to translate a string encoded in one form into the other.

Given any of these representations for strings, there remains much to do. Is a character upper case, lower case, or neither? In a sorted list, where do you put the names that begin with accented letters? What about the Cyrillic names? How are wide-character strings represented on byte streams? These issues are being addressed, in standards bodies and in corporate labs, but the results are not very portable yet. Rogue Wave has no crystal ball, so we simply pass through the semantics your system vendor has provided.

Tools.h++ includes two efficient string types, RWCString and RWWString. RWCString represents 8-bit strings, with some support for multibyte strings. RWWString represents "wide strings", strings of wchar_t. Both provide access to Standard C Library support for local collation conventions with the member function collate() and the global function strXForm(). In addition, the library provides conversions between wide and multibyte representations, via both streams and 8-bit strings. The wide- and multibyte-character encodings used are those of the host system.

To accommodate a user's choice of languages, a program must display titles, menu choices, and status messages in that language. Usually such texts are stored in a "message catalog" or "resource file", separate from program code, so they may be easily edited or replaced. Tools.h++ issues no messages; though it does not yet offer much help in this area, it also imposes no policy.

While accounting principles are the same everywhere, the currencies used vary among cultures not only in unit value, but in notation. Indeed, even raw numbers are written differently in different places; in the U.S.A, we would say that 2.345 is less than 2,345; but in much of Europe the reverse is true. In many cases a program must be able to display values in the notations customary to both the vendor and the customer.

Scheduling, which appears in many kinds of software, involves time and calendar calculations. Local versions of the Gregorian calendar vary in their names for the months and the days of the week, and in the order in which the components of a date are written. Notations for the time of day vary as well. Time representations are complicated by time zone conventions, including Daylight Savings Time (DST) rules that vary wildly from place to place, and from year to year in some places.

The Standard C Library provides, with <locale.h>, some facilities to accommodate differences in currency, number, date, and time formats, but it is maddeningly incomplete. It offers no help for conversion from strings to these types, and is practically impossible to use if you must do conversions involving two or more locales. Common time zone facilities (such as those defined in POSIX.1) are similarly limited, usually offering no way to compute wall clock time for other locations, or even for the following year in the same location.

RWLocale and RWZone

Tools.h++ addresses these problems with the abstract classes RWLocale and RWZone. If you have used RWDate you have used RWLocale already, perhaps unknowingly. Every time you convert a date or time to or from a string, a default argument carries along an RWLocale reference. This is a reference to a global instance of the class RWLocaleDefault (derived from RWLocale) which was created at program startup. To use RWLocale explicitly, construct your own instance and pass it in place of the default. Similarly, when you manipulate times, a default RWZone reference is passed along, but you can substitute your own.

You can also install your own instance of RWLocale or RWZone as the global default. You can even install your RWLocale instance in a stream (this is called "imbuing the stream") so that dates and times inserted on (or extracted from) that stream are formatted (or parsed) accordingly, without any special arguments.

Let us look at how all this works, with some examples. Here are the header files we will use:

#incluce <assert.h>
#include <rw/rstream.h>
#include <rw/cstring.h>
#include <rw/locale.h>
#include <rw/rwdate.h>
#include <rw/rwtime.h>

Begin by constructing a date, today's date:

RWDate today = RWDate::now();

We can display it using ordinary "C"-locale conventions, the usual way:

cout << today << endl;

But what if you are in some other locale? Perhaps you have set your environment variable LANG to "fr", or "fr_FR", because you want French formatting. We would like the date to be displayed in your preferred local format. First, let's construct an RWLocale object:

RWLocale& here = *new RWLocaleSnapshot("");

Class RWLocaleSnapshot is the main implementation of the interface defined by RWLocale. It extracts the information it needs from the global environment during construction with the help of such Standard C Library functions as strftime() and localeconv(). The most straight-forward way to use this is to pass it directly to the RWDate member function asString():

cout << today.asString('x', here) << endl;

but there are more convenient ways. We can install here as the global default locale so the insertion operator will use it:

RWLocale::global(&here);
cout << today << endl;

Dates

Now, suppose you also want to format a date in German, but don't want that to be the default. Let us construct a German locale:

RWLocale& german = *new RWLocaleSnapshot("de");  // or, "de_DE"

Now we can format the same date for both local and German readers:

cout << today << endl
     << today.asString('x', german) << endl;

Let us now suppose you want to read in a German date string. The straight-forward way, again, is to call everything explicitly:

RWCString str;
cout << "enter a date in German: " << flush;
str.readLine(cin);
today = RWDate(str, german);
if (today.isValid())
  cout << today << endl;

Sometimes you would prefer to use the extraction operator ">>". It must know to expect and parse a German-formatted date. We can pass this information along by imbuing a stream with the German locale.

The following code snippet imbues the stream cin with the German locale, reads in and converts a date string from German, then displays it in the local format.

german.imbue(cin);
cout << "enter a date in German: " << flush;
cin >> today;  // read a German date!
if (today.isValid())
  cout << today << endl;

Imbuing is useful when many values must be inserted or extracted according to a particular locale, or when there is no way to pass a locale argument to the point where it will be needed. By using the static member function RWLocale::of(ios&), your code can discover the locale imbued in a stream. If the stream has not yet been imbued, of() returns the current global locale..

The interface defined by RWLocale handles more than dates. It can also convert times, numbers, and monetary values to and from strings. Each has its complications. Time conversions are complicated by the need to identify the time zone of the person who entered, or who will read, the time string. The mishmash of Daylight Savings Time jurisdictions can make this annoyingly difficult. Numbers are somewhat messy to format because the insertion and extraction operators ("<<" and ">>") for them are already defined by <iostream.h>. For money, the main problem is that there is no standard internal representation for monetary values. Fortunately, none of these problems is overwhelming.

Time

Let us consider the time zone problem. Our first observation is that there is no simple relationship between time zones and locales. All of Switzerland shares a single time zone, including DST rules, but has four official languages (French, German, Italian, and Romansch). Hawaii and New York, on the other hand, share a common language but occupy time zones five hours apart; or sometimes six hours apart, because Hawaii does not observe DST. Furthermore, time zone formulas have little to do with cultural formatting preferences. Thus, we use a separate time zone object, rather than letting RWLocale subsume time zone responsibilities.

In Tools.h++, the class RWZone encapsulates knowledge about time zones. It is an abstract class; we have implemented its interface in the class RWZoneSimple. Three instances of RWZoneSimple are constructed at startup, to represent local wall clock time, local Standard time, and Universal time (GMT). Local wall clock time includes any Daylight Savings Time in use. Whenever you convert an absolute time (as in the class RWTime) to or from a string, an instance of RWZone is involved. By default, the local time is assumed, but you can pass a reference to any RWZone instance.

It's time for some examples. Imagine you have scheduled a trip from New York to Paris. You will leave New York on December 20, 1993, at 11:00 PM, and return on March 30, 1994, leaving Paris at 5:00 AM, Paris time. What will the clocks show at your destination when you arrive?

First, let's construct the time zones and the departure times:

RWZoneSimple newYorkZone(RWZone::USEastern, RWZone::NoAm);
RWZoneSimple parisZone  (RWZone::Europe,    RWZone::WeEu);
RWTime leaveNewYork(RWDate(20, 12, 1993), 23,00,00, newYorkZone);
RWTime leaveParis  (RWDate(30,  3, 1994), 05,00,00, parisZone);

The flight is about seven hours long, each way:

RWTime arriveParis  (leaveNewYork + long(7 * 3600));
RWTime arriveNewYork(leaveParis   + long(7 * 3600));

Let's display the Paris arrival time and date in French, and the New York arrival time and date according to local convention:

RWLocaleSnapshot french("fr");  // or "fr_FR"
cout << "Arrive' au Paris a` "
     << arriveParis.asString('c', parisZone, french)
     << ", heure local." << endl;
cout << "Arrive in New York at "
     << arriveNewYork.asString('c', newYorkZone)
     << ", local time." << endl;

This works even though your flight crosses several time zones and arrives on a different day than it departed. Furthermore, on the day of the return trip (in the following year), France has already begun observing Daylight Savings Time, but the U.S. has not. None of these details is visible in the example code above - they are handled silently and invisibly by RWTime and RWZone.

All this is easy for places that follow those DST rules Tools.h++ has built-in. (Thus far, these are North America, Western Europe, and "noDST".) What about places that follow other rules, such as Argentina, where spring begins in September and summer ends in March? RWZoneSimple is table-driven; if the rule is simple enough, you can construct your own table (of type RWDaylightRule) and specify it as you construct an RWZoneSimple. For example, imagine that DST begins at 2 AM on the last Sunday in September, and ends the first Sunday in March. Simply create a static instance of RWDaylightRule:

static RWDaylightRule sudAmerica =
   { 0, 0, TRUE, {8, 4, 0, 120}, {2, 0, 0, 120}};

(See the RWZoneSimple documentation, and <rw/zone.h>, for details on what the numbers mean.) Then construct an RWZone object:

RWZoneSimple  ciudadSud( RWZone::Atlantic, &sudAmerica );

Now you can use ciudadSud identically as paris or newYork above.

But what about places where the DST rules are too complicated to describe with a simple table, such as Great Britain? There, DST begins on the morning after the third Saturday in April, unless that is Easter, in which case it begins the week prior! For such jurisdictions you might best use Standard time, properly labeled: they are probably used to it. If that just won't do, you can derive from RWZone and implement its interface for Britain alone. This is much easier than trying to make something general enough to handle all possibilities including Britain, and it's smaller and faster besides.

The remaining problem is that there is no standard way to discover what DST rules are in force for any particular place. In this the Standard C Library is no help. Often, however, you can get the user in question to provide the necessary information. One manifestation of this problem is that the local wall clock time RWZone instance is constructed to use North American DST rules, if DST is observed at all. If the user is not in North America, the default local time zone probably performs DST conversions wrong, and you must replace it. For example, for a user in Paris you could say:

RWZone::local(new RWZoneSimple(RWZone::Europe, RWZone::WeEu));

If you look closely into <rw/locale.h>, you will find that RWDate and RWTime are never mentioned. Instead, RWLocale operates on the Standard C Library type struct tm. RWDate and RWTime both provide conversions to this type. In some cases you may find using it directly is preferable to using RWTime::asString().

For example, suppose you must write out a time string containing only hours and minutes (e.g. 12:33). The standard formats defined for strftime() (and implemented by RWLocale as well) don't include that option, but you can fake it. Here's one way:

RWTime now = RWTime::now();
cout << now.hour() << ":" << now.minute() << endl;

Without using various manipulators, this might produce a string like "9:5". Here's another way:

RWTime now = RWTime::now();
cout << now.asString('H') << ":" << now.asString('M') << endl;

This produces "09:05".

In each of the previous examples, now is disassembled into component parts twice, once to extract the hour and again for the minute. This is an expensive operation. If you expect to work with the components of a time or date much, you may be better off disassembling the time only once:

RWTime now = RWTime::now();
struct tm tmbuf;
now.extract(&tmbuf);
const RWLocale& here = RWLocale::global();  // the default global locale
cout << here.asString(&tmbuf, 'H') << ":"
     << here.asString(&tmbuf, 'M'); << endl;

If you work with times before 1901 or after 2037, RWTime cannot be used, because it does not have the range needed. struct tm operations with RWLocale are not so restricted; you can use RWLocale to perform conversions for any time or date.

Numbers

For numbers, RWLocale provides an interface for conversions between strings and numbers -- both integers and floating point values. RWLocaleSnapshot implements this interface, providing the full range of capabilities defined by the Standard C Library type struct lconv. This includes using appropriate digit group separators, decimal "point", and currency notation. On conversion from strings it allows, and checks, the same digit group separators. Unfortunately, the standard iostream library provides definitions for number insertion and extraction operators which cannot be overridden, so stream operations are clumsier than we might like.

Instead, we use RWCString functions directly:

RWLocaleSnapshot french("fr");
double f = 1234567.89;
long i = 987654;
RWCString fs = french.asString(f, 2);
RWCString is = french.asString(i);
if (french.stringToNum(fs, &f) &&
    french.stringToNum(is, &i))  // verify conversion
  cout << f << "\t" << i << endl
       << fs << "\t" << is << endl;

The French use "," for the decimal point, and "." for the digit group separator, so this might display:

1.234567e+07	987654
1.234.567,89	987.654

Numbers with digit group separators are certainly easier to read.

Currency

Currency conversions are trickier, mainly because there is no standard way to represent monetary values in a computer. We have adopted the convention that such values represent an integral number of the smallest unit of currency in use. For example, in the U.S, to represent the balance "$10.00", you might say

double sawbuck = 1000.;

This representation has the advantages of wide range, exactness, and portability. By wide range, we mean that it can exactly represent values from $0.00 up to (and beyond) $10,000,000,000,000.00. This is larger than any likely budget. By exactness, we mean that representing monetary values without fractional parts, you can perform arithmetic on them and compare the results for equality:

double price = 999.;		// $9.99
double penny = 1.;		//  $.01
assert(price + penny == sawbuck);

This would not be possible if the values were naively represented, as for instance "price = 9.99;".

By portability, we mean simply that double is a standard type, unlike common 64-bit integer or BCD representations. Of course, financial calculations may still be performed on such other representations, but because it is always possible to convert between them and double, this supports everyone. In the future RWLocale may directly support some other common representations as well.

Let us consider some examples of currency conversions:

const RWLocale& here = RWLocale::global();
double sawbuck = 1000.;
RWCString tenNone  = here.moneyAsString(sawbuck, RWLocale::NONE);
RWCString tenLocal = here.moneyAsString(sawbuck, RWLocale::LOCAL);
RWCString tenIntl  = here.moneyAsString(sawbuck, RWLocale::INTL);
if (here.stringToMoney(tenNone,  &sawbuck) &&
    here.stringToMoney(tenLocal, &sawbuck) &&
    here.stringToMoney(tenIntl,  &sawbuck))  // verify conversion
  cout << sawbuck  << "  " << tenNone << "  "
       << tenLocal << "  " << tenIntl << "  " << endl;

In a U.S. locale, this displays:

1000.00000  10.00  $10.00  USD 10.00

Wrap up

We have covered lots of territory - alphabets, languages, dates, times, time zones, numbers, money - and yet have only scratched the surface of what can be done by combining these facilities. Internationalization is a brave new world for software engineering, but with the proper tools it can be more exciting than distressing.


© Copyright 1995, Rogue Wave Software, Inc.